Skip to content

fix(tbtc/signer): gate plaintext state, zeroize FFI buffers, allow periodic reshare#4124

Merged
mswilkison merged 12 commits into
extraction/frost-signer-mirror-2026-05-26from
fix/signer-review-findings
Jul 1, 2026
Merged

fix(tbtc/signer): gate plaintext state, zeroize FFI buffers, allow periodic reshare#4124
mswilkison merged 12 commits into
extraction/frost-signer-mirror-2026-05-26from
fix/signer-review-findings

Conversation

@mswilkison

Copy link
Copy Markdown
Contributor

Summary

Three findings from the #4005 review (the multi-agent custody review + Codex's review), fixed and verified.

1. [HIGH] Plaintext state fallback was accepted unconditionally — persistence.rs

decode_persisted_state_storage_format fell back to parsing an unencrypted, unauthenticated PersistedEngineState whenever the bytes weren't a valid AEAD envelope — with no gate at all. The AEAD envelope is the only integrity control for state-at-rest, so anyone who could write the state file could forge it (cleared consumed_* replay markers → resume-into-double-sign, or an attacker dkg_key_package → admission bypass) without holding the state-encryption key.

Per the secret-material hardening plan, the plaintext path is now emergency-rollback-only: compile-time disabled in release builds (cfg!(debug_assertions)), rejected in production profiles, and gated behind an explicit TBTC_SIGNER_PERMIT_PLAINTEXT_STATE_ROLLBACK opt-in. The two tests that load plaintext (migration + schema-mismatch) opt in, and a negative assertion pins the fail-closed default.

2. [Codex P1] Secret FFI buffers left unwiped on free — ffi.rs

free_buffer now zeroizes the buffer before deallocation, so secret material returned by nonce/DKG/key-package endpoints (e.g. frost_tbtc_generate_nonces_and_commitments) is wiped rather than left in allocator memory when a caller forgets to clear it. (zeroize is a volatile wipe the optimizer can't elide.)

3. [Codex P2] Periodic reshare was a one-shot — lifecycle.rs

After the first refresh_shares, any later reshare with updated current_shares had a different fingerprint and was rejected as SessionConflict, so a long-lived session could never advance refresh_history/refresh_count/cadence past the first refresh. A different fingerprint is now treated as a new periodic refresh; idempotent replay of the same fingerprint is unchanged.

Deferred (flagged for the team)

  • Signing-policy firewall production force-on (my MEDIUM finding): mirroring the provenance gate is correct in principle, but force-enabling the firewall makes its entire policy config (allowed script classes, output caps, …) mandatory in production — a deployment-contract change. The verifier flagged it as needing product confirmation, so it's left as-is pending that decision.
  • State-at-rest anti-rollback freshness counter and quarantine-reset replay-marker loss: design-level (needs an external monotonic anchor) / inherent to that opt-in policy (default is fail-closed).

Verification

  • cargo test297 passed, 0 failed.
  • cargo clippy --all-targets -- -D warningsclean (exit 0).

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

🤖 Generated with Claude Code

…riodic reshare

Three findings from the #4005 review (workflow review + Codex):

1. [HIGH] persistence: the legacy unencrypted plaintext state fallback was
   accepted unconditionally, bypassing the AEAD envelope -- anyone who could
   write the state file could forge it (cleared replay markers / attacker key
   material) without the state-encryption key. Per the secret-material hardening
   plan the plaintext path is now emergency-rollback-only: compile-time disabled
   in release builds, rejected in production profiles, and gated behind an
   explicit TBTC_SIGNER_PERMIT_PLAINTEXT_STATE_ROLLBACK opt-in. The two tests
   that load plaintext now opt in; a negative assertion pins the fail-closed
   default.

2. [Codex P1] ffi: free_buffer now zeroizes the buffer before deallocation, so
   secret material returned by nonce/DKG/key-package endpoints is wiped rather
   than left in allocator memory when a caller forgets to clear it.

3. [Codex P2] lifecycle: a subsequent same-session refresh with updated shares
   produced a different fingerprint and was rejected as SessionConflict, so a
   long-lived session could never advance refresh_history past the first refresh.
   A different fingerprint is now treated as a new periodic refresh; idempotent
   replay of the same fingerprint is unchanged.

Deferred (flagged, not in this change):
- Signing-policy firewall production force-on: force-enabling it makes the entire
  firewall policy config mandatory in production (a deployment-contract change);
  the verifier flagged it as needing product confirmation, so left as-is.
- State-at-rest anti-rollback freshness counter and quarantine-reset marker loss:
  design-level / inherent to the opt-in policy.

Verified: cargo test (297 passed, 0 failed) and cargo clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d38e6243-cdfc-4c1f-92f3-2228716f365c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/signer-review-findings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

mswilkison and others added 11 commits June 30, 2026 13:28
Follow-up to the #4124 review (multi-agent + Codex), which found regressions in
the periodic-refresh change and a gap in the FFI wipe:

1. Refresh stale-retry (CONFIRMED, Codex P2): removing SessionConflict made
   refresh re-execute on a replayed/older request -- re-deriving stale shares and
   bumping the epoch. Now a fingerprint already in refresh_history (accepted but
   no longer current) is rejected as a stale retry; a genuinely-new fingerprint
   still proceeds (repeatable periodic reshares). RefreshHistoryRecord gains an
   optional request_fingerprint (serde-default, backward compatible).

2. Unbounded refresh_history (CONFIRMED): the removed guard was the only bound.
   Cap per-session history at MAX_REFRESH_HISTORY (256), dropping oldest;
   strictly-increasing epochs preserved so continuity still holds.

3. FFI incomplete wipe (PLAUSIBLE): to_ffi_buffer's into_boxed_slice() shrinks and
   reallocates when capacity > len (serde_json over-allocates), freeing the
   original secret buffer un-wiped. Copy into an exact boxed slice and zeroize the
   source Vec so no un-wiped copy survives.

4. Quarantine schema-mismatch test exercised plaintext-refusal, not schema
   validation, after the plaintext gate landed: opt into the rollback flag like
   its sibling so it tests the intended path.

5. Plaintext-gate doc comment overclaimed "compile-time disabled in release";
   clarify the load-bearing guard is the runtime non-production check.

Verified: cargo test (298 passed) + clippy -D warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: the rollback-path tests opt into the plaintext path, which
legacy_plaintext_state_permitted() compiles out in release (cfg!(debug_assertions)
is false), so under `cargo test --release` they fail at the post-acceptance
assertions. cfg-gate the three plaintext-acceptance tests to debug_assertions;
release always refuses plaintext before those assertions, so there is nothing for
them to cover there.

Verified: cargo test (298 passed, debug) + cargo test --release (295 passed; the 3
tests correctly compiled out) + clippy --release -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: a host that installs an init-time config has signer_env_var read
the installed config, not the process environment, so the documented
TBTC_SIGNER_PERMIT_PLAINTEXT_STATE_ROLLBACK env flag had no effect for it -- the
emergency plaintext migration was impossible to enable via the configured-host
path. Add permit_plaintext_state_rollback to InitSignerConfigRequest and map it in
config_values_from_request, mirroring the other enforcement flags.

Verified: cargo test (298 passed) + clippy -D warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: a session loaded from state written before
RefreshHistoryRecord.request_fingerprint deserializes its history records with
None, so the last accepted refresh fingerprint survives only in
session.refresh_request_fingerprint. The first post-upgrade periodic refresh
overwrote that value, so a delayed retry of the pre-upgrade refresh no longer
matched the history scan and was accepted as a new epoch instead of
SessionConflict.

Before overwriting refresh_request_fingerprint, backfill the previously-accepted
fingerprint onto the most-recent history record when it is not already tracked
(the legacy None case). Adds a regression test that simulates pre-upgrade state
(strip history fingerprints) and verifies a retry of the old refresh is rejected
after a new one.

Verified: cargo test (299 debug / 296 release) + clippy -D warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: refresh_cadence_status reported refresh_count as
session.refresh_history.len(), which the MAX_REFRESH_HISTORY cap bounds -- so a
long-lived session that ran more than 256 refreshes showed a stuck count of 256
even as later refreshes succeeded.

Add a monotonic SessionState.refresh_count (and its PersistedSessionState mirror,
serde-default for back-compat, plus both conversions) incremented on each accepted
refresh and backfilled from the retained history length for legacy state; report
it from refresh_cadence_status instead of the pruned history length. Adds a
regression test that prunes history and asserts the reported count is the true
total.

Verified: cargo test (300 passed) + clippy -D warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… load

Codex re-review: refresh_count is #[serde(default)], so state written before the
field existed deserializes it to 0 even when refresh_history holds accepted
refreshes -- and since refresh_cadence_status now reports refresh_count, an
upgraded node reported 0 until the next refresh. Backfill it from
refresh_history.len() in the Persisted->Session conversion (evaluated before
refresh_history is moved), with a regression test.

Verified: cargo test + clippy -D warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntine recheck)

Codex re-review of the signer FFI surface:

1. [P1] Secret request fields not zeroized (api.rs): SignShareRequest.nonces_hex
   and key_package_hex deserialized into plain Strings dropped without
   zeroization, retaining the one-time nonce / private key package in freed heap.
   Wrap both in Zeroizing<String> (wiped on drop) + a redacting Debug so the
   secrets no longer leak via {:?} either.

2. [P1] Panic hook defeats payload redaction (ffi.rs): catch_unwind does not
   suppress Rust's default panic hook, so a panic payload (path/config/secret)
   prints to stderr before being converted to the redacted ErrorResponse. Install
   a panic hook at the FFI boundary that redacts the payload outside development.

3. [P2] Quarantine not rechecked on cached-round reuse (signing.rs): the matched-
   fingerprint reuse branch returned a fresh share without the new-round path's
   enforce_not_quarantined_identifiers check, so a participant quarantined after
   the round opened could still obtain a share. Apply the check on reuse too.

Verified: cargo test (301 passed) + clippy -D warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pty history

Codex re-review: legacy state written before refresh_history existed can carry
refresh_request_fingerprint/refresh_result with an EMPTY refresh_history. The
fingerprint backfill used refresh_history.last_mut(), which is None for empty
history, so the previous fingerprint was not recorded and a delayed retry of that
pre-upgrade refresh was re-executed as a new refresh.

When there is no record to backfill onto, synthesize one from the cached
refresh_result (prior epoch, keeping history strictly increasing) carrying the
previous fingerprint, so stale retries stay rejected. Adds a regression test for
the empty-history legacy shape.

Verified: cargo test (302 passed) + clippy -D warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h env

Two Codex-flagged issues in the non-interactive signing path:

1. start_sign_round fingerprinted and derived the round id from the raw
   request.message_hex string. hex::decode accepts mixed casing, so two calls
   carrying identical message bytes but different hex casing derived different
   fingerprints/round ids and a semantically identical retry was rejected as a
   SessionConflict. Lowercase message_hex right after it validates as hex,
   mirroring the interactive path (interactive.rs), so the fingerprint and
   derive_round_id are casing-independent.

2. The README-documented `cargo bench --features bench-restart-hook --bench
   phase5_roast` failed in a clean shell: ensure_benchmark_environment never set
   TBTC_SIGNER_PROFILE (missing -> production) or TBTC_SIGNER_STATE_ENCRYPTION_KEY_HEX
   (required by the default `env` state-key provider), so the first RunDkg
   persist aborted. Seed profile=development, provider=env, and a fixed dev key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prior commit lowercases message_hex before computing every canonical and
legacy fingerprint. That regressed the upgrade path: when a node upgrades with
an active sign round that a previous build started using mixed/uppercase
message_hex, the persisted sign_request_fingerprint was computed over that
original casing. The same retry then matched none of the fingerprints and fell
through to SessionConflict instead of reusing the cached round.

Capture the pre-canonicalization casing and, only when it differs from the
lowercased form, compute the four legacy fingerprint shapes over the original
casing and accept them in the cached-round match. A match migrates the stored
fingerprint to the canonical lowercase form, so the compatibility cost is paid
once per in-flight round. Adds a regression test that persists a mixed-case
fingerprint and asserts the retry reuses the round and migrates it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d result

The empty-history synthesize branch was guarded by `else if let Some(previous_result)`,
so a legacy/degraded state that kept refresh_request_fingerprint but whose
refresh_result deserialized to None (and whose refresh_history is empty) fell
through: neither the last_mut() backfill nor the synthesize branch ran. The next
refresh then overwrote the only copy of the old fingerprint, so a delayed retry
of the pre-upgrade request was treated as fresh and advanced the epoch — a replay.

Make the branch unconditional: synthesize a history record carrying the previous
fingerprint whether or not a cached result is present. Prefer the result's epoch
(when non-zero and below the new refresh) and share count; otherwise fall back to
refresh_epoch-1 (min 1) and a zero share count. refresh_epoch_counter is
persisted, so a prior accepted refresh implies refresh_epoch >= 2 and the fallback
stays non-zero and strictly below the new record, keeping history monotonic.

Adds a regression test (empty history + refresh_result=None) verified to fail
against the pre-fix code (the stale retry was re-executed instead of rejected).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mswilkison mswilkison merged commit 01e274b into extraction/frost-signer-mirror-2026-05-26 Jul 1, 2026
20 checks passed
@mswilkison mswilkison deleted the fix/signer-review-findings branch July 1, 2026 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant